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Abstract 

Mutations in the BRCA2 tunnor suppressor protein leave individuals susceptible 
to breast, ovarian and other cancers. The BRCA2 protein is a critical 
connponent of the DNA repair pathways in eukaryotes, and also plays an 
integral role in fostering genonnic variability through nneiotic reconnbination. 
Although present in nnany eukaryotes, as a whole the BRCA2 gene is weakly 
conserved. Conserved fragnnents of 30 annino acids (BRC repeats), which 
nnediate interactions with the reconnbinase RAD51 , helped detect orthologs of 
this protein in other organisnns. The carboxy-ternninal of the hunnan BRCA2 has 
been shown to be phosphorylated by checkpoint kinases (Chk1/Chk2) at 
T3387, which regulate the sequestration of RAD51 on DNA dannage. However, 
apart fronn three BRC repeats, the Drosophila melanogaster gene has not been 
annotated and associated with other functionally relevant sequence fragnnents 
in hunnan BRCA2. In the current work, the carboxy-ternninal phosphorylation 
threonine site (E=9.1e-4) and a new BRC repeat (E=17e-4) in D. melanogaster 
has been identified, using a fragnnented alignnnent nnethodology (FRAGAL). In 
a sinnilar study, FRAGAL has also identified a novel half-a- tetratricopeptide 
(HAT) nnotif (E=11e-4), a helical repeat nnotif innplicated in various aspects of 
RNA nnetabolisnn, in Utp6 fronn yeast. The characteristic three aronnatic 
residues with conserved spacing are observed in this new HAT repeat, further 
strengthening nny clainn. The reference and target sequences are sliced into 
overlapping fragnnents of equal paranneterized lengths. All pairs of fragnnents in 
the reference and target proteins are aligned, and the gap penalties are 
adjusted to discourage gaps in the nniddle of the alignnnent. The results of the 
best nnatches are sorted based on differing criteria to aid the detection of 
known and putative sequences. The source code for FRAGAL results on these 
sequences is available at https://github.conn/sanchak/FragalCode, while the 
database can be accessed at www.sanchak.com/fragal.htnnl. 
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iij:j»7iii^ J Changes from Version 1 

I would like to thank the referees for the insightful observations on 
my work. I have revised my manuscript in which I have addressed 
all the specific points raised, and believe that these have improved 
the manuscript significantly. The main changes incorporated in this 
version are summarized below: 

1 . Shortened the title - A fragmented alignment method detects 
a putative phosphorylation site and a putative BRC repeat in the 
Drosophila melanogaster BRCA2 protein. 

2. Added pseudo code of FRAGAL to the methods section. 

3. Applied FRAGAL to two new repeats - BIR and TPR. However, 
I could not detect any new repeat based on FRAGAL results. 

4. Added proteins from Aspergillus nidulans and Candida glabrata 
which have the HAT repeat in the FRAGAL processing. 

5. Used another multiple sequence alignment (MAFFT) to 
corroborate results obtained from Clustal-W. 

6. Table 1 has been simplified based on the suggestion of the 
referees. 

7. Specified the tool used for computing E-values. 

Please find my detailed responses underneath each referee report. 

See referee reports 



Introduction 

The breast cancer susceptibility protein BRCA2, first identified in 
1995^ is a critical recombinase regulator^ that ensures genomic 
stability through high fidelity repair^'^ of double stranded breaks 
(DSB) and prevents stalled replication forks from replicating^ in the 
DNA. The primary recombinase in BRCA2 repair of DSB through 
homologous recombination is the RAD5 1 protein, belonging to the 
conserved RecA/RAD5 1 family^ that binds to the BRCA2 protein 
at various segments of -30 amino acids (BRC repeats)^ ^ and in the 
C-terminal region in most vertebrates*^ "'. Checkpoint kinases phos- 
phorylate a serine*^ and a threonine"' at the carboxy-terminal region 
of BRCA2, thereby regulating its interaction with RAD51. BRCA2 
also plays a key role in fostering genomic variability through 
meiotic recombination^ although a different recombinase 
(DMCl) is implicated in this pathway in mammalian species^^. 

The BRC repeats have helped identify BRCA2 orthologs in vari- 
ous eukaryotic species Functional characterization of this gene 
in Drosophila melanogaster has demonstrated its interaction with 
RAD51, and a critical role in mitotic and meiotic DNA repair as well 
as homologous recombination^ ^'^^ The copy number of the BRC 
repeats differs considerably. The BRCA2 homolog in Ustilago maydis 
(a yeast like fungus) has a single BRC repeat^^\ the D. melanogaster 
homolog contains only three (known) repeats while there are 
eight repeats in the human BRCA2 gene^. Even among the 
Drosophila genus, the range of BRC repeat numbers is varied - the 
D. melanogaster species has only three repeats, while D. persimilis 
and D. pseudoobscura have up to eleven repeats RAD51 shows 
varying affinity for the different BRC motifs ^^'^'l This difference in 
repeat numbers in Drosophila has raised doubts whether 'this higher 
repeat number is real or a genome mis-assembly artifact'-", and also 
led to speculation on the evolution of these closely related organ- 
isms Any such hypothesis would need to be revisited if a new 
BRC motif were to be identified in D. melanogaster. 



In the current work, the putative threonine phosphorylation site 
for checkpoint kinases (Chkl/Chk2) (E=9.1e-4) and a new BRC 
repeat (E=17e-4) in D. melanogaster has been identified, using a 
fragmented technique for the pairwise alignment of two sequences 
(FRAGAL). The reference and target sequences are sliced into frag- 
ments of equal parameterized length X, sliding along the sequence 
in intervals of length Y, such that Y is less than X. Thus, the slices 
have overlaps. An alignment of all pairs of slices in the reference 
and target proteins is done using the global alignment program 
'needle' from the EMBOSS suite^^. The gap penalties are adjusted 
to discourage gaps in the middle of the alignment. The results of 
the best matches are sorted based on differing criteria to aid the 
detection of known and putative sequences. In order to establish 
the generic nature of the FRAGAL methodology, the detection of 
a new half-a-tetratricopeptide (HAT) repeat sequence (E=lle-4) in 
a nucleolar RNA-associated protein (Utp6) from Saccharomyces 
cerevisiae is also reported. HAT is a helical repeat motif implicated 
in various aspects of RNA metabolism-^ -^. The characteristic three 
aromatic residues with a conserved spacing are observed in this new 
HAT repeat, further strengthening my claim-\ 

Existing methods for detecting functional motifs in a given pro- 
tein sequence have been unable to detect these putative sites. For 
example, meta servers (http://myhits.isb-sib.ch/cgi-bin/motif_scan, 
http ://w WW. ebi . ac.uk/Tools/pfa/iprscan/, http://www.genome.jp/tools/ 
motif/) for detecting motifs in a protein have been unable to detect 
the sites identified using the FRAGAL methodology. These meta 
servers use one or more motif databases^^"^". Not all known BRC 
repeats have a low E- value when aligned with the new BRC repeat. 
For example, the first BRC repeat in hBRCA2 when aligned to the 
new dmBRCA2 repeat has an E=0.04, much more than the E=17e-4 
observed for the fourth repeat, which is the one I report here. Ideally, 
if one took all the BRC repeats and did a search in the dmBRCA2 
sequence, this new repeat would be reported. Essentially, this is 
what FRAGAL does - albeit implicitly, by automatically fragment- 
ing the sequence. The same logic applies to the HAT repeat, where 
the sequences are more varied and thus the choice of the repeat 
would effect the detection of new motifs. 

Spliced alignment techniques have frequently been adopted in the pre- 
cise identification of eukaryotic gene structures, and in gene assembly. 
These methods try to solve the exon assembly problem by searching 
the exon sequence space to find the best fit to known proteins^ 
While these methods use graph algorithms to solve the computation- 
ally difficult problem of exon chaining, FRAGAL does the converse 
of finding best matches in known exon chains (i.e. protein sequences). 

It is fair to mention that the FRAGAL method is much more com- 
putationally intensive than the above mentioned methods. At the 
same time, FRAGAL makes no assumption of any knowledge of 
the conserved regions (either the sequence or their position). The 
choice of the fragment length in FRAGAL depends on the length 
of repeats that is expected to be present in the protein. Since both 
repeats (BRC and HAT) discussed in this manuscript are around 
-30 amino acid long, I have chosen a fragment length of 50. A larger 
fragment length might mask the similarity in the core region due to 
variations in the non-critical regions, whereas a smaller fragment 
would match irrelevant portions and thus increase false positives. 
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The significant conservation of the DNA repair and checkpoint 
pathways in flies and higher organisms'', the advanced genetic 
tools available for Drosophila, and the viability of the Drosophila 
BRCA2 null mutants in contrast to mammalian mutants^^ establishes 
Drosophila as a model organism for studying these pathway s^^ 
Significant divergence of key conserved sequences proves to be a 
serious hurdle for alignment techniques to annotate and associate 
the conserved sequences in the human BRCA2 to the Drosophila 
BRCA2'^\ Thus, a generic methodology, applicable to distantly 
evolutionary related proteins like BRCA2 and nucleolar RNA- 
associated proteins is presented. The methodology has been 
validated by the identification of two novel functionally relevant 
sites in the BRCA2 protein from D. melanogaster, and a HAT repeat 
in Utp6 from S. cerevisiae. 

Materials and methods 

The FRAGAL methodology is shown in Supplementary Figure 1 . 

The sequences are split into fragments of X amino acids, with the 
starting indices sliding across the sequence length in steps of Y 
amino acids (SLA.fasta and SLB.fasta in Data Files). The score for 
each match is computed as shown in Equation 1. FRscore is intended 
to give more weightage to identical residues in the alignment. 



FRscore = 1/3 * %onlySimilarity + 2/3 * %identity\ 



(1) 



One sorting criteria is to rank the matches based on the best 
average score, while another takes the cumulative score of a stretch 
of fragment matches. Stretches of fragments are stitched while 
ensuring the slices in the sequences are in an increasing order and 
non-overlapping. The best average criteria will typically select 



Algorithm 1 : FRAGALQ - A fragmented alignment method 

Input: A. Query sequence 
Input: B\ Target sequence 
Input: X: Length of fragments 
Input: V: Length of sliding window 
Input: gapopen: Length of fragments 
Input: gapextend: Length of fragments 

Output: (l>fragrr^gnts- Matching fragments sorted in terms of higher FRscores 
begin 

^Afrag^ FragementSequencelntoOverlappingSegments(/\, X, Y); 
^Bfrag^ FragementSequencel ntoOverlappingSegments(B, X, Y); 
II Create priority queue, based on FRscore 

^fragments ~ ^ ' 

for each fragemnt /\, in 0^^^^^ do 
for each fragemnt B. in 0^^^^^ do 

// See methods section for FRscore 

FRscore, j = RunNeedle(/\,,B^, gapopen, gapextend); 



^^^^'^ihagments'FFISCOre^: 



end 
end 
return ( 



single fragments, while the cumulative scoring criteria will bring 
forth longer conserved regions. 

The threshold for sequence similarity for each fragment is param- 
eterized, and set to 30% in the default mode. A large threshold will 
exclude more relevant matches, while a smaller threshold might 
include more false positives. The pairwise alignment for each 
fragment pair is done by a global alignment program 'needle' 
from the EMBOSS suite- ^ The parameters are set as follows 
- matrix=BLOSUM62, Gap penalty=25.0 and Extend penalty=0.5. 
The gap penalty is increased from the default value of 10 to ensure 
that gaps are discouraged in the middle of the alignment. Single 
deletions or insertions are rarely expected in conserved fragments. 
However, once 'needle' has aligned the sequences based on the this 
penalty, gaps should not have a penalty. It is for this very reason 
that I have introduced the FRscore as a metric to measure quality 
of alignment, which creates a weighted score of the %identity and 
% similarity (Equation 1). 

The user is allowed to specify an annotation file for a given pro- 
tein sequence using the uniprot accession syntax (Supplementary 
Figure 2). The results from FRAGAL can be filtered based on this 
annotation, and this provides a easier way to manually inspect and 
annotate corresponding segments in a query protein sequence. 

The FRAGAL package is written in Perl on Ubuntu. Hardware 
requirements are modest - all results here are from a simple work- 
station (2GB RAM). The source code for FRAGAL results on 
these sequences is available at https://github.com/sanchak/Fragal- 
Code, while the database can be accessed at www.sanchak.com/ 
fragal.html. The multiple sequence alignment was done using 
ClustalW^^. PHYML has been used to generate phylogenetic trees 
from these alignments, which is based on the method of maximum 
likelihood^l The method searches for a tree with the highest prob- 
ability or likelihood that, given a proposed model of evolution 
and the hypothesized history, would give rise to the observed data 
set. The alignment and cladograms images were generated using 
Seaview^'^. E-values and z-scores have been computed using the 
Protein Information Resource (http://pir.georgetown.edu/pirwww/ 
search/pairwise. shtml)'^^\ 



BRCA2 sequence fragments and database of the output of 
FRAGAL for BRCA2 and HAT repeats for different organisms 

3 Data Files 

http://dx.d0i.0rg/l 0.6084/m9.f igshare.81 2563 



end 



Results 

Breast cancer susceptibility protein BRCA2 

The D. melanogaster gene (0030169)"^^ encodes a 971 amino acid 
protein (dmBRCA2, Uniprot Accession:Q9W157), and contains 
three BRC repeat units (conserved sequences of -30 amino acids 
that binds to RAD5 1)^''^. In contrast, the human BRCA2 gene prod- 
uct (hBRCA2, Uniprot Accession:P51587) is 3418 amino acids 
long and contains eight BRC repeats^. Further, the hBRCA2 protein 
is annotated for several sites phosphorylated by checkpoint kinases. 
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which regulate its interaction with RAD5P "\ FRAGAL was run 
on the dmBRCA2 and hBRCA2 sequences. Table 1 shows the best 
matches obtained using two different sorting criteria - best average 
FRscore (see Methods) and best average % similarity - either when 
the match in hbrca2 is known to be conserved (Table lA) based on 
an user defined input file (Supplementary Figure 2) or otherwise 
(Table IB). 

Detecting the threonine phosphorylation site in the carboxy- 
terminal region of dmBRCAl. Table 1 shows a significant match 
(E=9.1e-4, Z-score=100) between fragment 91 of dmBRCA2 to the 
fragment 337 in hBRCA2, which contains the T3387 that is phos- 
phorylated by the checkpoint kinases Chkl and Chk2. Z-scores 
above a value of 8 are considered to be significant"^^. The align- 
ment shows that the T3387 corresponds to the T926 of dmBRCA2 
(Figure la). The conservation of this region in the Drosophila 
and mammalian species is demonstrated by the multiple sequence 
alignment of three organisms from each species (Figure lb). The 
highly conserved columns in the alignment are highlighted using an 
asterisk, and can be used to define a Prosite motif ([ST]-E-[ST] [ST]- 
x-[ST]-x(6)-[ED]-x(4)-K-x(4)-[ST]-[ST]-[ST]-x(3)-[DE]-[DE])-l 
Either this motif or FRAGAL alignments failed to detect this site 
in other species distant from Drosophila or mammals (Ustilago 
maydis and Caenorhabditis elegans). 

Detecting an additional BRC repeat in Drosophila melanogaster. 

The correct identification of the three BRC repeats in D. melanogaster 
is seen by the significant scores of the FRscore matches of A67-B 152 



Table 1. FRAGAL results for aligning the BRCA2 
protein sequences from Drosophila melanogaster (k) 
(Accession :Q9W1 57) and humans (B) (Accession:P51587):The 
results are filtered out if the fragment in the hBRCA2 sequence 
is not marked as conserved (Supplementary Figure 2). This 
filtering lielps in removing already annotated sequences, thus 
making it easier to observe new sequences. Thus, there are some 
missing ranks. Multiply index with 10 to get sequence starting 
position in original sequence. A91 refers to the sequence starting 
at 910 in A and going till 959, since the fragmenting length is 50. 
The A91-B337 match corresponds to the phosphorylation site of 
checkpoint kinases in the carboxy-terminal of BRCA2, while the 
match A61-B151 (not shown in Table, FRscore=53), corresponds to 
the new BRC repeat identified in D. melanogaster. 



Rank 


FRscore 


Matches 


1 


73.5 


A91-B337 


3 


64.7 


A87-B197 


4 


64.2 


A67-B152 


5 


64 


A87-B194 


6 


64 


A65-B140 


9 


61.7 


A57-B167,A87-B197 


10 


61.4 


A67-B100 


11 


61.4 


A69-B199,A74-B204 


12 


61.3 


A89-B335 


14 


60.1 


A75-B152 



(64), A57-B100 (60) and A75-B152 (60) (Table 1). A significant 
alignment (E=17e-4, Z-score=95) between A61-B 151 (35.8%sim- 
ilarity and 17%identity) (Figure Ic) was also observed. This 
sequence (634-664:LDTALKRSIESSEEMRSKASKLVVVDTT- 
MR) is now added to the list of sequences previously studied in 
the Drosophila genus The multiple sequence alignment (obtained 
using ClustalW^^) (Figure 2a) and phylogenetic trees (obtained 
using PHYML^^) (Figure 2b) shows that this new BRC repeat is 
more related to D. willistoni than other organisms in the Drosophila 
genus. A detailed molecular phylogeny of Drosophilid spe- 
cies has noted that the subgenus Sophophora is 'divided into 
D. willistoni and the clade of D. obscura and D. melanogaster 
groups', possibly indicating the source of this BRC repeat that has 
been conserved between D. willistoni and D. melanogaster^^ . The 
same inference is drawn when we use a different multiple align- 
ment tool like MAFFT (http://mafft.cbrc.jp/alignment/software/)'^^ 
(Supplementary Figure 3). An iterative methodology, similar to 
PSI-BLAST (Position-Specific Iterative Basic Local Alignment 
Search Tool)^\ can be automated to generate comprehensive motifs 
spanning distant species. The conservation of many key residues in 
this sequence fragment, as shown by comparing it to the sequence 
logo of the Prosite BRCA2 profile (PS50138) (Figure 2c) strongly 
suggests that this is a putative BRC repeat. However, it must be 
emphasized that such repeats are to be considered putative until 
verified experimentally 

Half-a-tetratricopeptide (HAT) motif 

HAT is a helical repeat motif implicated in various aspects of RNA 
metabolism and in protein-protein interactions^^'^^. These repeats 
are characterized by three aromatic residues with a conserved spac- 
ing-\ A variable number of HAT repeats (9 to 12) are found in 
different proteins. Figure 3a shows a novel HAT repeat (E=lle-4, 
Z-score= 116) detected in a nucleolar RNA- associated protein (Utp6) 
from Saccharomyces cerevisiae (Uniprot Accession: Q02354) 
by comparing it to HAT repeats from a human nucleolar RNA- 
associated protein (Uniprot Accession:Q9NYH9). Q9NYH9 has 
five annotated HAT repeats (121-153, 156-188, 304-335, 488-520 
and 524-557), while Q02354 has three HAT repeats (87-119, 
124-156 and 159-191). The new HAT sequence identified in 
Q02354 (SLIMKKRTDFEHRLNSRGSSINDYIKYINYESN) is 
from position 30 to 62. It can be seen from the multiple sequence 
alignment that this sequence has the desired aromatic residues at 
the proper spacing, a requisite for being considered a HAT repeat 
(Figure 3a and b). Further, the MSA shows large variation amongst 
HAT sequences even within the same organism (Figure 3b). Finally, 
Figure 3b and c shows that certain HAT repeats are more similar 
to HAT repeats from other organisms than to other HAT repeats 
in its own sequence. Supplementary Figure 4 shows the align- 
ment and phylogenetic tree when we include more proteins having 
HAT repeats from organisms closely related to S. cerevisiae like 
Aspergillus nidulans and Candida glabrata, corroborating the large 
variation among repeats even within the same organism and that 
often HAT repeats across organisms show more similarity. 

Database for aligning different pairs of BRCA2 

A database (www.sanchak.com/fragal.html) which lists the results 
for the fragmented alignment of various proteins with BRC and 
HAT repeats sequences has been created. The results have been 
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A # Length: 53 

# Identity: 9/53(17.0%) 

# Similarity: 27/53 (50.9%) 
#Gaps: 6/53(11.3%) 

# Score: 21.5 



A9 1 ; 1 - IILSTETSTSC ALPTMERFAP KPSS TSTPLADRDLNR - -SKDCTKNRQDAED 

B337; 1 QFISVSESTRTAPTSS EDYLRLKRRCTTSLI KEQES SQASTEECEKNKQD - - - 



B 



901 



D. slmulans 
D. sechellia 
D.melanogaster 
Humans 
chimpanzee 
orangutan 



XPTTTsstIts tscJEspSIHII rapiPssTST pHUMBJCnIs --RDcISInIq 

-ICIVSSTETS TSCASPDMER FAPKPSSTST PLADRDLNRS - -KDCAKNRQ 

--IILSTETS TSCALPTMER FAPKPSSTST PLADRDLNRS - -KDCTKNRQ 

KQFISVSEST RTAPTSSEDY LRLKRRCTTS LIKEQESSQA STEECEKNKQ 

KQFISVSEST RTAPTSSEDY LRLKRRCTTS LIKEQESAQA STEECEKNKQ 

IQEISMilST IXttPISSn^ MUHICTTS UiPQSSSQA GTEECEKNEQ 



3360 



T3387 in hbrc 



C # Length: 53 

# Identity: 9/53(17.0%) 

# Similarity: 19/53 (35.8%) 
#Gaps: 6/53(11.3%) 

# Score: 22.0 



A6 1 ; 1 EFQSKETIQQNDYLVHQPNDKPTS VGLDTALKRS I ESSEEMRSKASKLW - - - 

B151; 1 - - -NQLVTFQGQPERDEK 1 KEPTLLGFHTASGKKVKIAKESLDKVKNLFDEKE 



Figure 1. Fragment alignment using 'needle' from the EMBOSS suite^ of previously unknown conserved, and functionally relevant, 
sequences in dmBRCA2. (red for identity, green for similarity), (a) Putative phosphorylation site by checkpoint kinases in the carboxy-terminal 
of hBRCA2. The threonine that is phosphorylated is highlighted (T3387 in hBRCA2 and T926 in dmBRCA2) (E=0. 00091, Zscore=100). (b) 
Conserved sequence in the carboxy-terminal of the BRCA2 protein sequence: Checkpoint kinases Chk1 and Chk2 phosphorylate threonine 
3387 in hbrca2, and is seen to be conserved in the mammalian and Drosophila species (T926 in dmBRCA2). (c) Putative BRC repeat 
identified by the similarity of fragment 61 (634-664:LDTALKRSIESSEEMRSKASKLVVVDTTMR) in D. melanogaster to the BRC4 repeat in 
hBRCA2 (1517-1551) (E=0.0017, Zscore=95) (red for identity green for similarity). 



generated by varying two parameters - length of the fragments 
and the threshold % similarity value for a significant match in a 
fragment pair. As mentioned above, the results are presented in two 
formats - best cumulative score and best average score. 

Discussion 

Genetic evolution over large time spans often leaves little trace of kin- 
ship in different organisms, even when the functional roles of the genes 
remains conserved. A relevant example is the BRCA2 gene which, 
although present in many eukaryotes, is weakly conserved^l The 
BRCA2 protein plays a major role in maintaining genomic stability, 
fostering genetic variability and also has other cellular functions-'^'^. 
Individuals with germline mutations in the BRCA2 gene are at signifi- 
cantly greater risk to a wide range of cancers^^^'^^ This is supposed to 
be primarily due to the instability in chromosome structure and number 
induced by functional aberrations in BRCA2''-. Conserved fragments 
of -30 amino acids (BRC repeats)^ that mediates the interaction of 
BRCA2 with the RAD51 recombinase^^ have been instrumental in 
identifying BRCA2 orthologs in other species^"^'^^. The BRCA2 protein 
in the Drosophila genus assumes significance in this context owing to 
the advanced tools available for Drosophila genetics and has been 
functionally characterized recently^ ^'^l 

However, weak sequence conservation in this gene has proven to be 
an impediment in associating experimentally proven functionally 



relevant gene fragments in humans and Drosophila. The variability 
in the number of BRC repeats even within the Drosophila species 
has provided fodder for further speculation on the evolution of this 
gene^^'^°. The detection of a new BRC repeat would necessitate the 
reevaluation of such hypotheses. 

Apart from the BRC repeats, RAD51 interacts with BRCA2 in the 
carboxy-terminal, and this interaction is modulated by checkpoint 
kinases^'^°. Since the introduction of BRC repeats in the cell inhibits 
the formation of RAD51 nucleoprotein filaments^ a model has been 
suggested whereby RAD51 binds to both the BRC repeats and the 
carboxy-terminal in undamaged cells, and DNA damage triggers the 
release of the carboxy-terminal bound RAD51 via the phosphorylation 
of a threonine residue^^. 

Thus, it is noted that certain functionally significant domains are much 
more conserved compared to the complete protein^l In the current 
work, a methodology to annotate proteins in such 'twilight' zones^^ by 
fragmenting and aligning two protein sequences (Figure 1) has been 
presented. The results are sorted based on differing criteria, and 
can be directed by a input file in case the sequences have already 
been annotated. This method helps in quickly honing onto con- 
served sites through visual inspection (Table 1 and Figure 1). The 
threonine phosphorylation site (E=9.1e-4) for checkpoint kinases 
(Chkl/Chk2) (Figure 1) and a new BRC repeat (E=17e-4) using 
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Figure 2. New BRC repeat identified by FRAGAL in Drosophila melanogaster. (a) The multiple alignment for this new sequence 
(634-664:LDTALKRSIESSEEMRSKASKLVVVDTTMR) (using ClustalW) highlighted as melanogaster4 and other sequences compared 
previously in^^. This putative sequence is more closely related to the sequences in D. wi 1 1 istoni than other members of the genus, (b) The 
phylogenetic tree (using PHYML) gives a graphical representation of the relation of the various repeats in the Drosophila genus, corroborating 
the closer relation of the new BRC repeat to D. willistoni. (c) Alignment of the new BRC repeat to the sequence logo of the Prosite BRCA2 
repeat profile PS50138. 



FRAGAL (Figure 2) has been identified. Pruning out matches 
which do not have a corresponding conserved sequence in hBR- 
CA2 helps us to select fragment 61 in dmBRCA2 as a new BRC 
repeat^'^^, and fragment 91 in dmBRCA2 as the putative threonine 
site for phosphorylation by checkpoint kinase Chkl and Chk2"\ It 
must be noted that the sites identified remain putative until verified 
by experimental data, in spite of the low E- values obtained. 



The multiple alignments can be used to create (for the carboxy- 
terminal phosphorylation threonine site) or extend (for the new 
BRC repeat) Prosite motifs. However, the carboxy-terminal 
phosphorylation threonine site Prosite motif generated from the 
multiple alignment of sequences from Drosophila and mammals 
did not result in any matches in other organisms (Ustilago maydis 
and Caenorhabditis elegans). 
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Figure 3. New Half-a-tetratricopeptide (HAT) motif identified by FRAGAL in Saccharomyces cerevisiae. (a) Pairwise alignment of a 
previously unannotated HAT motif in S. cerews/ae (E=11e-4, Z-score=116) (red for identity, green for similarity), (b) The multiple alignment for 
this new sequence (using ClustalW) with other HAT motifs in S. cerevisiae and humans shows large variation amongst HAT sequences even 
within the same organism. The conserved spacing of the aromatic residues are also highlighted, (c) The phylogenetic tree (using PHYML) 
shows that certain HAT repeats are more similar to HAT repeats from other organisms than to other HAT repeats in its own sequence. 
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In order to justify this method further, I concentrated on proteins 
that contain the Half-a-tetratricopeptide (HAT) repeat motifs. The 
HAT motif is much less ubiquitous than the related tetratricopeptide 
(TPR) repeat, and has been implicated in various aspects of RNA 
metabolism^^'^^. HAT motifs are also hypothesized to play a criti- 
cal role in assembling RNA-processing complexes^\ A recent study 
that combined bioinformatics, modeling and mutagenesis stud- 
ies of the HAT domain used the three tandem HAT motifs in the 
Saccharomyces cerevisiae protein Utp6 to make inferences about 
the residues that confer structural and/or functional properties to 
the motif. In the current work, the detection of a new HAT repeat 
sequence (E=lle-4) in Utp6 from S. cerevisiae has been reported. 
This sequence has the desired aromatic residues at the proper spac- 
ing, a requisite for being considered a HAT repeat-^ The above 
mentioned study would have gained further by the knowledge of 
this HAT repeat, a repeat that remained undetected by sequence 
analysis using other methods. The HAT repeats are much more 
varied, and thus not suitable for generating motifs (like Prosite-^). 
For example, the consensus sequence has been derived from an 
alignment of 742 HAT motifs from Pfam^^^ and had to be manually 
edited since this alignment included gaps in greater than 90% of 
the sequences^^ Moreover, FRAGAL detects that a particular HAT 
sequence in one protein is more related to HAT sequences from 
other species that other HAT repeats present in its own sequence. 
This raises interesting questions about their evolutionary history. 

In some of the significant matches in Table 1 the fragment in 
hBRCA2 is not annotated to be functionally relevant - for exam- 
ple fragments 33 and 87 of dmBRCA2 and fragments 176 and 
194 in hBRCA2, respectively. These fragments might suggest an 
important, yet unknown, functional relevance of that stretch of the 
human gene as well, since it is conserved across distant species. 



An excellent database for Drosophila related information is avail- 
able at http://flybase.org/^^. A database (www.sanchak.com/fragal. 
html) for BRCA2 and nucleolar RNA-associated proteins from dif- 
ferent organisms, and will be updating this on a regular basis to 
include more organisms and different repeats has been created. The 
increasing importance of Drosophila as a model system for cancer 
research'''' in the search for human therapeutics^^"^^ can be exploit- 
ed to the hilt once the conserved mechanism is fully understood. 
FRAGAL presents the first step by annotating putative conserved 
sequence fragments in Drosophila and humans. 
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Figure S1. FRAGAL methodology. The BRCA2 protein sequences from Drosophila melanogaster and humans are split into fragments of 
parameterized length (50 in this case), at a parameterized interval (10 in this case). All pairs of fragments are aligned, and the results stitched 
such that there are no overlap in any given match and the order of the match is not interspersed. The alignment is done using the global 
alignment program 'needle' from the EMBOSS suite, and the gap penalties are set to 25 to discourage gaps in the middle of the alignment. 
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Figure S2. Annotating the hbrca sequence. The syntax is similar to the one used in the UNIPROT accession site. 
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Figure S3. New BRC repeat identifed by FRAGAL in Drosophila melanogaster aligned using l\/IAFFT. (a) The multiple alignment for 
this new sequence (634-664:LDTALKRSIESSEEMRSKASKLVVVDTTMR) (using ClustalW) marked as melanogaster4 and other sequences 
compared previously in^^ using MAFFT for doing multiple sequence alignment. This putative sequence is more closely related to the 
sequences in D. willistoni than other members of the genus, as shown by the alignment done using ClustalW. (b) The phylogenetic tree 
corroborates the closer relation of the new BRC repeat to D. willistoni, similar to the ClustalW results. 



Page 11 of 21 



FIOOOResearch 2013, 2:143 Last updated: 10 FEB 2014 



spi Q9NYH9 I ise-iae 

Sp|q02354 1 159-191 



417.448 
1^0,152, 
257.291, 
4B4.518. 



Q5BDX1 
Q5BDX1 
Q5BDX1 
Q5BDX1 



Aspergillus 
Aspergillus 
Aspergillus 
Aspergillus 



sp|Q9NYH& I 304-335 
335.369, Q5BDX1 Aspergillus 
154,185. Q5BDX1 Aspergillus 
520.551, Q5BDX1 Aspergillus 
187.218. QSBDXl Aspergillus 
450.482. Q5BDX1 Aspergillus 
8S . 118 . I 05BDX1 1 Aspergillus 
290.322.>sp QSFU45 CAN 
sp Q02354 124-156 
sp 488-520 
sp Q9NYH9 524-557 
3 01 . 3 3 3 , I gSBDXl | Aspergi 1 lus 
52. S4. Q5BDX1 1 Aspergillus 
sp|Q9BTYH9|l21-153 
171.204.>sp Q6FU4S|CAN 
46.78.>ap|Q6FU45|CAN 
Bp| Q02354 I 30-62-NEW 
12B.160.>sp QSFU45|CAN 
sp|Q02354|87-119 
80.114 .>sp IQGFU4S , CAN 
24S.281.>sp QSPU45|C2W 



ANFK SCRNI 
LNLQ-AAKKT 
ENIN-HAEHL 
KEYE-RARAI 
DDTE-RAEAI 
IKEE-RCCAV 
GDPE-RVRDV 
GNIP-GTRQV 
GBYE-RERQL 
NEFE-HARAI 
FIFV-RCRTL 
KEFR-RARSI 
ALKD RVTYV 
TSYK-KIHNI 
GGYK-KARAV 
CNMA-NIREY 
VILA-KRRVQ 

-eyqgri:eke 

ATKT-RLSKV 
QRLD-HIREV 
DDYD-NSRKL 
-SLIMKKRTD 
WIK-AFKLV 
-SI<3QRIGFI 
NEFD-TVEKL 
ANIN-NIPQY 



FLRALRFHPE 

FQNGLRFNPC 

LGQAIGWCPK 

LDRAVTILPl 

TKYALDRLPR 

YELGIDQPTL 

YEEAVKTLP- 

YERAIAQIPP 

FERWMSWEP 

YERLLQKTD- 

FQRFTIVHP- 

YEKQIFWNPS 

FERALDVDST 

YKQAIQQLLF 

YNQLLKLHPT 

FKSLQESRPF 

YERALREFGS 

YEEQLKENLR 

FEDTVERNRI 

PSAMLAIHSN 

YKKMLCVPFD 

YAELHERFPL 

FEHRLNSRGS 

MDKCATFEPK 

YQRGTNKFPQ 

LAQCLAGDLE 

VLPCKKNDHT 



C- -PKLYKEY 
V--PKLWYEY 
-DKLFRGY 
V- -DKLWYKY 
SKSITLHEAY 
DMPELVWKAY 
- -TEAMWKCY 
SQEKRHWRRY 
DEGAWSAY 
--HVKVWINY 
- -EPHNWIKW 
- -NSQSWIQY 
- -SVPLWIRY 
^ -EPEIWYDY 
--NVDIWISC 
- - SVDFFRKM 
-ADS0LWMDY 
- -NYDVWFDF 
- "NKKNWMRY 
--KPALWIMA 
K-LEKMWNgY 
--YSPLWTMH 
- -EINDYIKY 
- -ASSFMNDY 
--DIJCFWAMY 
NNDLSLWSTY 
--QLEAWLKW 



FRMELM 
VKFELH 
IDLERQ 
VYMEET 
TTFEKQ 
IDFEDD 
ITFCLE 
lYLWIP 
IKLEKR 
ARFEIN 
AKFEEE 
AELERG 
lESEHR 
VMYEFD 
AKYEYE 
IQFEKE 
HKEELH 
ARLEEQ 
AAWELE 
AKWEME 
TLWEQE 
LQSELQ 
IKYESN 
LGFLHQ 
LNYiflCA 
LDYVRR 
lAWEKE 
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L sp|Q02354|30-62-NEW 
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(a) 



Figure S4. Including more proteins having the HAT motif, (a) The multiple alignment for this new sequence (using ClustalW) with other 
HAT motifs in S. cerevisiae, humans, Aspergillus nidulans and Candida glabrata shows large variation amongst HAT sequences even within 
the same organism. The conserved spacing of the aromatic residues remains the same, (b) Similar to the case when fewer organisms were 
included, the phylogenetic tree (using PHYML) shows that certain HAT repeats are more similar to HAT repeats from other organisms than 
to other HAT repeats in its own sequence. 
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In this well-written manuscript, Chakraborty presents a tool for local alignment of two protein sequences 
that includes a fragment-chaining step. He then uses this tool to identify important putatively functional 
fragments in two different Drosophila proteins by comparison to the respective human ortholog. A 
database containing results of many more similar applications is also presented and is a nice aspect of 
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the work. 

I have the following specific comnnents, mostly related to the presentation, that might help the author 
improve the clarity of the manuscript. 

The current title is rather long. The contribution of this work is mainly in the form of the FRAGAL tool, and 
the title could be trimmed to emphasize only that. The two example applications to finding the 
phosphorylation site, BRC repeat etc. are not experimentally substantiated biological claims, and may be 
better off being left out of the title. 

Clear discussion should be provided regarding other previous work where pairs of aligned fragments are 
stitched together (e.g. exon chaining, see Jones & Pevzner 2004, and the chain/net approach to whole 
genome alignments). 

Since the FRScore does not include gap penalties, I am assuming that each pair of fragments is 
subjected to two distinct similarity-scoring approaches; the gap-based approach when aligning that pair of 
fragments using 'needle' and the match/mismatch based approach when ranking the aligned pairs. This 
should be stated clearly, to avoid confusion. Is there a reason why the needle score was not used in place 
of the FRScore? 

It appears that the FR score is the unweighted sum of %similarity and %identity. This should be stated 
explicitly. 

I did not quite understand the formatting of Table 1 . 

• I think there should be a line separating 'A' from 'B' (which I think comes after a row with rank 1 4 in 
the second column). It took me some time to see that there are two sub-tables being shown here. 

• Also, the fact that the same row is used to show different entities was confusing; usually a table is 
constructed so that a row shows different pieces of information about one entity. 

• I assume something like A91 -B337 refers to the starting positions of a matching fragment between 
sequences A and B, and the length of that fragment is not indicated in the row. Is this correct? (On 
reading further I realize that this interpretation is incorrect, and the numbers in a match are arbitrary 
indices and not coordinates. This was not clear from the legend.) 

• I found that presenting sub-tables 'A' and 'B' (which I finally realized does not relate to 'A' and 'B' 
sequences) leads to more confusion than it helps. If both sub-tables present the same ranked list 
and the only difference is that 'A' filters for "conserved" fragments, then it might be better to show 
only sub-table 'B' and add a column indicating if this is a fragment marked conserved. 

Where is the E-value of an FRScore coming from? (I am not familiar with what the 'Protein Information 
Resource' provides.) Perhaps this E-value correspond to the global alignment score reported by needle? 

The results of Table 1 do not aid ones understanding as to how the fragmented alignment, i.e. stitching 
together of fragments, helps in this case. As shown, this appears similar in form to a ranked list of 
matches from a standard local-aligner. Similarly, with respect to Figure 2, the author may wish to discuss 
why FRAGAL finds the 'melanogaster4' fragment as a BRC repeat where previous annotations (that found 
three repeats) failed. Was this a matter of previous methods 'missing tine ttiresiioid? (This seems unlikely 
given the strong E-value reported for this.) Similar clarifications for the HAT repeat finding exercise will 
also be helpful. 
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I have read this submission. I believe that I have an appropriate level of expertise to confirm that 
it is of an acceptable scientific standard. 

Competing Interests: No competing interests were disclosed. 



1 Comment 

Author Response 

Sandeep Chakraborty, Tata Institute of Fundamental Research, India 
Posted: 17 Sep 201 3 

I greatly appreciate the positive comments. 

The current title is rattier iong. Tine contribution of tinis worl< is mainiy in tine form of tine FRAGAL 
tooi, and tine titie couid be trimmed to emptiasize oniy tinat. Ttie two exampie applications to finding 
tiie ptiosptioryiation site, BRC repeat etc. Are not experimentaiiy substantiated bioiogicai claims, 
and may be better off being left out of the title. 

Since this work began in search for unannotated fragments of the dmBRCA2 sequence, and is 
being followed in the lab actively, I think the reference to an application of FRAGAL is warranted. 
However, I have removed the reference to the HAT repeat and stated the fact that the 
phosphorylation site and the BRC repeat is putative in the title. 

Clear discussion should be provided regarding other previous work where pairs of aligned 
fragments are stitched together... 

I have discussed the spliced alignment approach to genome assembly in the discussion. However, 
I have noted that while these methods use graph algorithms to solve the computationally difficult 
problem of exon chaining, FRAGAL does the converse by finding best matches in known exon 
chains (i.e. protein sequences). 

Since the FRScore does not include gap penalties, I am assuming that each pair of fragments is 
subjected to two distinct similarity-scoring approaches; the gap-based approach when aligning that 
pair of fragments using 'needle' and the match/mismatch based approach when ranking the 
aligned pairs. This should be stated clearly, to avoid confusion. Is there a reason why the needle 
score was not used in place of the FRScore? 

The needle score includes gap penalties, which is something that is not intended for use in 
FRScore, as you have correctly pointed out. The idea is to direct the alignment to discourage gaps 
- but once the alignment is done a gap should not have a penalty. It is 'real' and therefore only the 
identity or similarity that matters. 

It appears that the FR score is the unweighted sum of %similarity and %identity. This should be 
stated explicitly. 

I have empirically assigned more weightage to the %identity based on the fact that we are 
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searching for repeats, and expect higher conservation. This conservation is magnified a bit more 
by assigning higher weightage. 

/ did not quite understand ttie formatting of Tabie 1. ... I found tiiat presenting sub-tables A and B 
(wliicti I finally realized does not relate to A and B sequences) leads to more confusion than it 
helps. 

I apologize for this confusion. The line demarcating subtables A and B was lost in the typesetting - 
and I missed out on detecting this error. Further, naming the subtables A and B was a poor choice 
of names, since the sequences were also named A and B. Finally, I agree that the second column 
was unnecessary, as was two subtables. I have simplified this table. 

/ assume something like A91-B337 refers to the starting positions of a matching fragment between 
sequences A and B, and the length of that fragment is not indicated in the row. Is this correct? (On 
reading further I realize that this interpretation is incorrect, and thenumbers in a match are arbitrary 
indices and not coordinates. This was not clear from the legend.) 

I apologize for this oversight. This is mentioned in the web pages - 

http://sanchak.com/fragal/ALLRUNS.BRCA2/Caenorhabditiselegans.G5EG86.Homosapiens.P51 587.( 
as ^ Multiply index with 10 to get sequence starting position in original sequence'. Thus A91 refers 
to the sequence starting at 910 in ^A' and going till 959, since the fragmenting length is 50. 1 have 
now mentioned this at the beginning of the Results section, and in the legend. 

Where is the E-value of an FRScore coming from? ... Perhaps this E-value corresponds to the 
global alignment score reported by needle? 

I have specified the website (http://pir.georgetown.edu/pirwww/search/pairwise.shtm ), and cited 
the paper by Wu C etal. (2003) The Protein Information Resource. Nucleic Acids Res 31 : 345347. 

... with respect to Figure 2, the author may wish to discuss why FRAGAL finds the 'melanogaster4' 
fragment as a BRC repeat where previous annotations (that found three repeats) failed. ... Similar 
clarifications for the HAT repeat finding exercise will also be helpful. 

I could only make an educated guess as to why other tools failed to detect these repeats. I believe 
that the tools used had a 'sequential' methodology and therefore one match fixed the order of the 
next searches. Not all known BRC repeats have a low E-value when aligned with the new BRC 
repeat. For example, the first BRC repeat in hBRCA2 when aligned to the new dmBRCA2 repeat 
has an E=0.04, much more than the E=1 7e-4 observed for the fourth repeat (which is the one I 
report here). Ideally, if one took all the BRC repeats and did a search in the dmBRCA2 sequence, 
this new repeat would be reported. Essentially, this is what FRAGAL does, albeit implicitly, by 
automatically fragmenting the sequence. The same logic applies to the HAT repeat, where the 
sequences are more varied and thus the choice of the repeat would affect the detection of new 
motifs. 

Competing Interests: No competing interests were disclosed. 
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Approved with reservations: 22 August 2013 
Referee Report: 22 August 2013 

The author presents an interesting technique for detecting new BRC repeats. The paper is generally well 
written, but needs some additional material to bolster its case. The introduction section needs more 
discussion of the 'state-of-the-art' in the alignment and motif detection area (especially with respect to 
detecting BRC repeats). The paper's argument can be made stronger by explicitly mentioning the 
advantages of fragmented alignment over any other recursively applied local alignment method or 
homology search method. Some discussion of existing methods exists in the discussion/results section, 
but that needs to be made available in the introduction in order to justify the need for creating a new 
method An explanation of the choice of parameter values needs to be given. For example, why does 
FRscore have weights of 1/3 for only Similarity (equation 2)? Also, the reasoning for choosing particular 
values for gap open and gap extend penalties needs to be mentioned along with whether parameter 
tuning/search was done to arrive at those values. Otherwise, those numbers seem arbitrary. The author 
needs to discuss whether he tried any other alignment algorithms apart from Clustal-W. Some other 
algorithms such as MAFFT, ProbCons or CONTRAIign may yield better results. The author may want to 
discuss the data in a separate section under methods/materials. A figure describing the FRAGAL pipeline 
will be useful to visually describe the pipeline/algorithm. 

I liave read tiiis submission. I believe tliat I liave an appropriate level of expertise to confirm that 
it is of an acceptable scientific standard, however I have significant reservations, as outlined 
above. 

Competing Interests: No competing interests were disclosed. 



2 Comments 



Author Response 

Sandeep Chakraborty, Tata Institute of Fundamental Research, India 
Posted: 23 Aug 2013 

Dear Dr Chikkagoudar, 

I would like to thank you for your insightful suggestions which will help improve the manuscript. I 
will make the suggested changes, and incorporate them in a new version shortly. 

A small clarification - you have asked for a "figure describing the FRAGAL pipeline". There is a 
supplementary figure SI doing this. Do you think that this figure is insufficient, or were you 
suggesting that I move this to the main manuscript? 

Best regards, 

Sandeep 

Competing Interests: No competing interests were disclosed. 
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Author Response 

Sandeep Chakraborty, Tata Institute of Fundamental Research, India 
Posted: 17 Sep 201 3 

I appreciate the positive comments, and hope to have addressed the concerns detailed below. 

The introduction section needs more discussion of tine 'state-of-tlie-art' in tine aiignment and motif 
detection area (especiaiiy witii respect to detecting BRC repeats) ... Some discussion of existing 
mettiods exists in tine discussion/resuits section, but ttiat needs to be made avaiiabie in tine 
introduction in order to justify tine need for creating a new mettiod. 

I have moved this section to the introduction. In response to the comments of another reviewer (Dr 
Saurabh Sinha), I have also mentioned the possible reasons why existing tools have failed to 
detect these repeats. Further, I have cited methods that use spliced alignment methods for 
genome assembly, noting that these methods use graph algorithms to solve the computationally 
difficult problem of exon chaining. FRAGAL does the converse by finding best matches in known 
exon chains (i.e. protein sequences). 

An expianation of tine ctioice of parameter vaiues needs to be given. For exampie, winy does 
FRscore iiave weigtits of 1/3 for oniy Simiiarity (equation 2)? 

The extra weightage given to % identity in the score is due to the fact that one expects more 
sequence conservation in repeats. 

...tiie reasoning for ciioosing particuiar vaiues for gap open and gap extend penaities needs to be 
mentioned aiong witii winetiner parameter tuning/searcii was done to arrive at tiiose vaiues. 

The gap penalties are set to discourage gaps, but not gap extensions. The gap opening has been 
set to two values - 10 and 25. Results using both values have been uploaded in the database 
http://sanchak.com/fragal/BRCA2.html . The results using 25 have been observed to be better. 
However, a complete statistical analysis of these values is beyond the scope of this work. 

Tiie auttior needs to discuss winetiner fie tried any ottier aiignment aigoritiims apart from Ciustai-W. 
Some otiier aigoritiims sucii as IVIAFFT, ProbCons or CONTRAiign may yieid better resuits. 

I have used another tool (MAFFT) to generate the multiple sequence alignment. The results from 
the new alignment tool, mirrors the inference drawn from Clustal-W. This is now a supplementary 
figure. 

A figure describing tiie FRAGAL pipeiine wiii be usefui to visuaiiy describe tfie pipeiine/aigoritiim. 

I have added a pseudo code of the FRAGAL program in the main manuscript. 
Competing Interests: No competing interests were disclosed. 
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Approved: 12 August 2013 
Referee Report: 12 August 2013 

A well written paper that proposes a new method, FRAGAL for identifying functional putative motifs within 
protein sequences which have been hidden from previous analyses. By splitting the sequences into 
overlapping fragments, this method is able to discover additional motifs. The author has tested his 
technique on BRC repeats in Drosophila c/a7?BRCA2 and HAT repeats in budding yeast Utp6, by 
comparing them to corresponding human protein sequences. The BRC repeat has been well analysed 
with comparisons across several Drosophila species. However the author does not provide extensive 
comparison of HAT repeats in Saccharomyces species. Since the sequences of several Saccharomyces 
sibling species and closely related fungi such as Aspergillus, Candida, etc. are known, it would be 
interesting to see how conserved this new HAT repeat is within the overall conservation of Utp6. 

While the author establishes the advantage of FRAGAL technique, it is too early to say that this is a useful 
generic tool to identify known and novel motifs in protein sequences. I would request the author to run his 
FRAGAL code on several protein sequences with small motifs to estimate success rates and false 
discovery rates of his method. A supplementary table should be provided describing several sequences 
analysed by this method and these rates. 

A minor comment. Table 1 should be simplified with the two BRCA2 protein sequences presented in two 
sub-tables. Please explain why certain ranks are missing in FRscore and %S. 

1 have read this submission. I believe that I have an appropriate level of expertise to confirm that 
it is of an acceptable scientific standard. 

Competing Interests: Although I am affiliated with the same institution as the author, I was not involved 
at any stage of manuscript preparation and did not collaborate with him at the time the review was written. 

2 Comments 



Author Response 

Sandeep Chakraborty, Tata Institute of Fundamental Research, India 
Posted: 23 Aug 2013 

Dear Dr Sinha, 

I greatly appreciate your comments on my manuscript. I will incorporate your suggested changes 
and update the manuscript. Your suggestion of including more of such repeats (I plan to do BIR 
and TPR) is something that will take some computational time and thus the delay. 

Best regards. 
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Author Response 

Sandeep Chakraborty, Tata Institute of Fundamental Research, India 
Posted: 17 Sep 201 3 

I am grateful for the encouraging comments on the work. 

...author does not provide extensive comparison of HAT repeats in Saccharomyces species. ...it 
wouid be interesting to see iiow conserved tiiis new HAT repeat is within the overaii conservation 
ofUtpG. 

I have implemented this interesting idea using proteins which have the HAT repeat from 
Aspergiiius niduians and Candida giabrata. This is now a Supplementary figure. However, these do 
not provide any further insights into the evolution of the HAT repeat, and would require 
sophisticated analyses beyond my expertise. 

/ wouid request the author to run his FRAGAL code on severai protein sequences with smaii motifs 
to estimate success rates and faise discovery rates of his method. A suppiementary tabie shouid 
be provided describing severai sequences anaiyzed by this method and these rates. 

In accordance with this suggestion, I have run FRAGAL on two more motifs (BIR and TPR). 
However, I failed to detect any new repeats using these two motifs. These are now part of the 
database - http://sanchak.com/fragal.html. 

A minor comment, Tabie 1 shouid be simpiified with the two BRCA2 protein sequences presented 
in two sub-tabies. Piease expiain why certain ranl<s are missing in FRscore... 

I have simplified the table considerably based on the comments of another reviewer (please see 
below). The naming of the sub tables as A and B was confusing given that the query and target 
sequences were named A and B. Further; the columns for the similarity scoring has been removed. 
We did not ever use the similarity only score, and this was adding to the confusion. I have now 
clearly stated the reason for some missing ranks. I apologize for the confusing aspects of this 
table. 

Competing Interests: No competing interests were disclosed. 
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